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Abstract 

We define the likelihood and give a number of justifications for its use as a skill measure for 
probabilistic forecasts. We describe a number of different scores based on the likelihood, and briefly 
investigate the relationships between the likelihood, the mean square error and the ignorance. 



1 Introduction 

Users of forecasts need to know: 

• whether the forecasts they are receiving have been adequately caUbrated 

• whether the forecasts they are receiving are any better than an appropriate simple model such as 
climatology 

• which of the forecasts they are receiving is the best 

To answer these questions, a single measure of forecast quality is needed. For calibration, the measure 
serves as a cost or benefit function that must be minimized or maximised in order to find the optimum 
values for the free parameters in the calibration algorithm. For comparison with climatology or other 
forecasts, the measure serves as a way of deriving a ranking. 

There are many standard measures of forecast quality. For example, for calibrating and comparing single- 
valued temperat ure forecasts , mean square error (MSE) is common. For binary probabilistic forecasts, 
the Brier score (iBrieil I195(tI ) is often used. For continuous probability forecasts, the continuous rank 
probability score and the ignorance have been suggested. 

In this paper we will argue that likelihood-based measures provide a simple and natural general framework 
for the evaluation of all kinds of probabilistic forecast. For example, likelihood based measures can be 
used for binary and continuous probability forecasts, for temperature and precipitation, and for one lead 
time or many lead times simultaneously. 

In section El we define the likelihood and discuss why we think it is a useful measure of forecast skill. In 
section|31we include expressions for the likelihood for the normal distribution and in section^we discuss 
relations between the likelihood and other forecast scoring methods. Finally in section [S] we summarise 
and describe some areas of future work. 



2 Probabilistic forecasts and the likelihood 

How should we evaluate the skill of a probabilistic forecast? We advocate the use of a particular set of 
measures that are taken from classical statistics, and are all based on the likelihood. Likelihood is defined 
very simply as the probability of the observations given the forecast. In this phrase the observations refers 
to the entire set of observations that we have available to validate a certain forecast, and the forecast 
refers to the entire set of co rresponding f orecasts. 

Likelihood was first used bv lFished lll912l) as a method for fitting parameters to parametric distributions. 
Fisher proposed the likelihood as the natural benefit function that one should maximise in order to define 
the best-fit parameters of the distribution. This suggestion was given a mathematical basis when it was 
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shown that the parameter values that maximise the hkchhood are the most a ccurate possible estimates 
for the unknown parameters for most problems (see|Casclla a nd Berged ()2002(l '). 

Fisher's problem, of how to evaluate the goodness of fit of a distribution to a number of samples, is 
exactly the same as the problem of how to evaluate a probabilistic forecast. Instead of the distribution 
we have the probabilistic forecast and instead of the samples we have the verifying observations. 

2.1 Advantages of the likelihood as a measure for skill 

We consider that the likelihood has the following advantages as a measure of probabilistic forecast skill: 

• It has a simple definition that, from a purely intuitive point of view, seems to be a reasonable basis 
on which to compare forecasts 

• It is mathematically optimal in the sense that estimates of parameters of calibrat ion models fitted by 
maxim ising the likelihood are usually the most accurate possible estimates fsee lCasella and Bergeil 

iool). 

• It is a generalisation to probabilistic forecasts of the most commonly used skill score for single 
forecasts: the RMSE (see section 0] below for a discussion of this). 

• It also shows how the RMSE can be generalised to the case of autocorrelated forecast errors 

• The properties of the likelihood have been studied at great length over the last 90 years: it is well 
understood 

• It is both a measure of resolution and reliability 

• Likelihood can be used for both calibration and assessment: this creates consistency between these 
two operations 

• Use of the likelihood also creates consistency with other statistical modelling activities, since most 
other statistical modelling uses the likelihood. This is important in cases where use of forecasts is 
simply a small part of a larger statistical modelling effort, as is the case for our particular business. 

• Likelihood can be used for all meteorological variables 

• Likelihood can be used to compare multiple leads, multiple variables and multiple locations at the 
same time in a sensible way (giving a single score) even when these leads, variables and locations 
are cross-correlated 

2.2 Forecast scores derived from the likelihood 

A number of different scores can be derived from the likelihood. 

• The log-likelihood (LL) reduces the range of values of the likelihood to a more manageable scale 

• Minus the LL (MLL) has the characteristic that better forecasts have lower values: in this way it 
is analogous to the MSE 

• The square root of the MLL (RMLL) has a further compressed scale 

• All these measures can be transformed into skill scores from zero to one in the usual way 

Other transformations are also possible: for instance, one might consider normalising by the number of 
data points. 

3 The likelihood for the normal distribution 

For a normal distribution the likelihood is given by: 

L = -=L=exp(-i(r - ^ir^-\T - ^)) (1) 

V27rdet ^ 

where T is the vector of observations, /i is the vector of means from the forecast, E is the covariance 
matrix of the forecast errors, and det is the determinant of S. 



The log-likelihood is then: 



/ = -izn(27rdet) - ^(T - fifll-^T - m) (2) 
In the case where the forecast errors can be assumed to be uncorrelated in time, the likelihood becomes: 

and the log- likelihood is: 

j=l i=l ' 

When evaluating a forecast using the likelihood, calculating the covariance matrix is straightforward 
because the forecast errors are known. When calibrating a forecast using the likelihood, calculating the 
covariance matrix is more difficult. If it is reasonable to assume that the errors are uncorrelated in time, 
then this simplifies the calibration considerably. However, this is generally not the case. 



4 Relations between the likelihood and other skill scores 

Likelihood is closely related to the RMSE and the ignorance, as we see below. 
4.1 Relation between the likelihood and RMSE 

We show that the RMSE and the likelihood are consistent (i.e. give the same ranking of forecasts) in 
the case of two normally distributed probabilistic forecasts with different means but the same constant 
spreads. Likelihood is used to compare the whole distribution, while RMSE is used to compare the means. 
Suppose we have two forecasts, A and B, and suppose: 

La > Lb (5) 

Taking logs, this gives: 

Ia > Ib (6) 
Substituting in the expression for the log-likelihood for a normal distribution we see that: 

i=N i=N 

- y?n(27r) - -ln{a) - — ^(^r - Ja)' > -^ln{2^) - -^ln{a) - — ^(x - fs f (7) 

i=l i=l 

where N is the number of observations, fa and fh are the time varying forecasts, and x is the time- varying 

observations. 

Cancelling terms from both sides: 

i=N ^ i=N 

i=l i=l 

Cancelling more terms this gives: 

i=N i=N 

Y^{x-fAf<Y.{x-fBf (9) 

or 

MSE^ < MSEs (10) 

and so wc sec that comparing these forecasts using likelihood or MSE gives the same results i.e. that 
forecast A is better than forecast B. 



4.2 Relationship between the likehhood and ignorance 

iRoulston and SmithI l(2n02f ) describe a score for the assessment of probabilistic forecasts that they caU 
the ignorance, and justify its usage on the basis of information theory and use in an optimal betting 
strategy. They define the ignorance for a single forecast-observation pair as minus the log (base 2) of the 
probability of the observation given the probabilistic forecast. We see that this is equivalent to minus log 
(base 2) of the likelihood for that single forecast-observation pair. 

Comparing forecasts using the ignorance or any of the likelihood-based scores described above will give 
the same results if the forecasts errors are uncorrelated in time. If the errors are correlated in time, and 
this is taken into account in the calculation of the likelihood, then they may give differing results. 
One can consider the likelihood as a generalisation of the ignorance to a) forecasts with autocorrelated 
forecast errors and b) forecasts for many variables, locations or leads at once. One can consider the 
ignorance as a special case of the likelihood when forecast errors are taken to be uncorrelated, and when 
looking at only a single variable, location and lead. 

5 Summary 

We have summarised the use of the likelihood for the evaluation of the skill of probabilistic forecasts. 
We believe that likelihood provides a useful general framework for the calibration and evaluation of all 
probabilistic forecasts, for all variables. We are in the process of applying the likelihood to various 
fore casting situat i ons tha t are relevant to our business: examples are given in Ijewson et al.l l)2003a|) 
a,ndl.TewsoTi et^ (jmS). 

A number of question arise that merit further investigation. These include: 

• When calibrating forecasts to maximise the likelihood, what numerical methods can be used to 
estimate the forecast error covariance matrix? 

• Is it really necessary to calculate the likelihood using the correct forecast error covariance matrix, 
or is it satisfactory in practice to make the assumption that forecast errors are uncorrelated? One 
can argue that if the covariance matrix is not correctly modelled, then forecasts with autocorrelated 
errors are given more credit than is their due. However, it may be that in practice the ranking of 
forecasts is the same whether or not the covariance is estimated accurately. 

• What are the relationships, if any, between the likelihood and other skill scores apart from those 
discussed above? 
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